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Abstract 

New high-throughput sequencing technologies have made it possible to pursue the advent of genome- 
wide transcriptomics. That progress combined with the recent discovery of regulatory non-coding 
RNAs (ncRNAs) has necessitated fast and accurate algorithms to predict RNA-RNA interaction 
probability and structure. Although there are algorithms to predict minimum free energy interac- 
tion secondary structure for two nucleic acids, little work has been done to exploit the information 
invested in the base pair probabilities to improve interaction structure prediction. In this paper, we 
present an algorithm to predict the Hamming centroid of the Boltzmann ensemble of interaction 
structures. We also present an efficient algorithm to sample interaction structures from the ensem- 
ble. Our sampling algorithm uses a balanced scheme for traversing indices which improves the 
running time of the Ding-Lawrence sampling algorithm. The Ding-Lawrence sampling algorithm 
has 0(n 2 m 2 ) time complexity whereas our algorithm has 0((n + m) 2 log(« + m)) time complexity, 
in which n and m are the lengths of input strands. We implemented our algorithm in a new version of 
piRNA ITOl and compared our structure prediction results with competitors. Our centroid prediction 
outperforms competitor minimum-free-energy prediction algorithms on average. 

1 Introduction 

The advent of genome-wide transcriptomics using high-throughput sequencing technologies and the 
recent discovery of regulatory non-coding RNAs (ncRNAs) have made it clear that RNA plays a large 
variety of important roles in living organisms that are more complex than being a mere intermediate in 
protein biosynthesis. A large portion of these ncRNAs regulate gene expression post-transcriptionally 
through binding and forming base pairs (and establishing a joint structure) with a target mRNA, like 
micro RNAs and small interfering RNAs (siRNAs) BOTH 02, antisense RNAs [6l[38l or bacterial small 
regulatory RNAs (sRNAs) |[T8l . In addition, antisense oligonucleotides have been used as exogenous 
regulators of gene expression, usually to knock out genes for bacterial studies. Antisense technology is 
now commonly used as both a research tool and for therapeutic purposes. Synthetic nucleic acids have 
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also been engineered to self assemble and interact in essentially nucleic acid machines P4l [331 l36l l37l 

ED- 

A key tool in all the above advances is an accurate tractable algorithm to predict the structure and 
base pairing probabilities between candidate regulatory ncRNAs and their potential targets. There are 
algorithms for predicting the most likely (the lowest total free energy) joint structure that can be formed 
by two interacting RNA strands fT|. Also, recently powerful algorithms for computing the partition 
function of interacting nucleic acid strands have been given (see our previous work |[T0l . for example, 
or II201 ). An important direction that has not been explored is to use the information invested in base 
pair probabilities to improve the accuracy of interaction structure prediction algorithms. In particular, 
Ding et al. take this promising direction but for prediction of the structure of a single nucleic acid strand 
lTT2l . In this paper, we aim to improve the accuracy of interaction structure prediction by centroids in 
the Boltzmann ensemble. 

In this paper, we present an algorithm to predict the Hamming centroid of an ensemble that is 
composed of the type of interactions that Alkan et al. HI considered. We also present an efficient 
algorithm to sample interaction structures from the ensemble. Similar to the approach of |[T3l . sampled 
structures are clustered and the centroids of the clusters are considered as candidate structures. We 
believe success of such an approach critically depends on the clustering method, therefore, we leave 
sampling-clustering algorithms for a separate study. Our sampling algorithm uses a balanced scheme 
for traversing indices which improves the worst case running time complexity of the Ding-Lawrence 
sampling algorithm from 0(n 2 m 2 ) to 0((n+m) log (n+m)), in which n and m are the lengths of input 
strands. We implemented our algorithm in a new version of pi RNA |[T0l and compared our structure 
prediction results with those of inteRNA (H and the software of Kato et al. j2ll . Our centroid prediction 
outperforms competitor minimum-free-energy prediction algorithms in most of the experiments and on 
average. 

Computational prediction of RNA secondary structure 

Several computational methods have emerged to study the secondary structure thermodynamics of a 
single nucleic acid strand. In the core of most methods lie a complete or variant of the Nearest Neighbor 
Thermodynamic energy model for a nucleic acid secondary structure ll24l . That model is widely consid- 
ered the standard energy model. It is based on an (almost) log-linear Boltzmann probability distribution 
founded on the assumption that stacking base pairs and loop entropies contribute additively to the free 
energy of a nucleic acid secondary structure. The standard energy model has been extended for pseu- 
doknots and RNA-RNA interaction OCL01CI71- Exploiting the additivity of the energy, efficient divide 
and conquer algorithms for predicting the minimum free energy secondary structure Il27ll40ll43ll33l and 
computing the partition function of a single strand |[25l[T7l have been developed. Also, Ding et al. give 
algorithms to predict the centroid of the Boltzmann ensemble and to sample structures from it lfT2l[T4l . 
Ponty provides a new sampling algorithm, based on a balanced traversal of indices, whose worst case 
running time complexity is 0{n\ogn) QUI . Ponty 's algorithm improves the running time complexity of 
the Ding-Lawrence algorithm, which is 0(n 2 ). 
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Prediction of RNA-RNA interaction 

Initial methods to study the thermodynamics of multiple interacting strands concatenate input sequences 
in silico in some order and consider them as a single strand |[2][5j. Dirks et al. present a method, as a part 
of NUPack, that computes the partition function for the whole ensemble of complex species carefully 
considering symmetry, sequence multiplicities, and special pseudoknots |[T6ll . However, concatenating 
the sequences is not an accurate approach as even if pseudoknots are considered, some useful interac- 
tions are excluded while some physically impossible interactions are included. Some other methods 
simplify the problem by avoiding internal base-pairing in either strand, and compute the minimum free 
energy hybridization secondary structure [51 [n] [23] ED- A third group predict the secondary structure 
of each individual RNA independently, and predict the (most likely) hybridization between the unpaired 
regions of two molecules f71 l26l[39l . 

In addition, a number of studies aim to compute the minimum free energy joint structure between 
two interacting strands under more complex structure and energy models. Pervouchine devises a dy- 
namic programming algorithm to maximize the number of base pairs among interacting strands ll29l . 
Kato et al. propose a grammar based approach to RNA-RNA interaction prediction Ell . More gener- 
ally, Alkan et al. HI study the interaction secondary structure prediction problem under three different 
models: 1) base pair counting, 2) stacked pair energy model, and 3) loop energy model. Alkan et al. 
prove that the general RNA-RNA interaction prediction under all three energy models is an NP-hard 
problem. To reduce the complexity of the problem, they suggest some natural constraints on the con- 
sidered interaction secondary structures. These assumptions are satisfied by all examples of complex 
RNA-RNA interactions in the literature. The resulting algorithms efficiently compute the minimum free 
energy secondary structure among all possible joint secondary structures that do not contain (internal) 
pseudoknots, crossing interactions (i.e. external pseudoknots), and zigzags (please see section |2]for the 
exact definition). In our previous work, we give a dynamic programming algorithm to compute the 
partition function over the ensemble of such interaction secondary structures iflOl . 

2 Methods 

For the sake of completeness, we include here our notation and definitions given in iTTOl . Throughout 
this paper, we denote the two nucleic acid strands by R and S. Strand R is indexed from 1 to Lr, and S is 
indexed from 1 to L$ both in 5' to 3' direction. Note that the two strands interact in opposite directions, 
e.g. R in 5' —¥ 3' with S in 3' «— 5' direction. Each nucleotide is paired with at most one nucleotide in 
the same or the other strand. We refer to the i th nucleotide in R and S by ig and i$ respectively. The 
subsequence from the i th nucleotide to the j th nucleotide in a strand is denoted by 

An intramolecular base pair between the nucleotides i and j in a strand is called an arc and denoted 
by a bullet i • j. An intermolecular base pair between the nucleotides and is is called a bond and 
denoted by a circle in o i s . An arc ir • jr covers a bond Ir o k$ if /# < Ir < jr. We call ir • ]r an 
interaction arc if there is a bond Ir ok$ covered by • Jr. Assuming z# < ]r, two bonds z# o i s and 
Jr ° js are called crossing bonds if is < js- An interaction arc /# • jr in a strand subsumes a subsequence 
[is,js] i n the other strand if for all bonds Ir o/c 5 , if is < ks < js then < Ir < Jr. Two interaction arcs 
ir • Jr and 15 • js are part of a zigzag, if neither ir • Jr subsumes [is,js] nor is • js subsumes 
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In this paper, we assume there are no pseudoknots in individual secondary structures of R and S, 
and also there are no crossing bonds and zigzags between R and S. 

2.1 Base pair probabilities and centroid prediction 

To estimate the centroid of the Boltzmann ensemble, it is sufficient to calculate the base pair probabil- 
ities and select those base pairs whose probability is at least 0.5. In this section, we describe how to 
calculate the base pair probabilities. Our algorithm for base pair probabilities is based on our dynamic 
programming algorithm for the interaction partition function piRNA presented in ifTOl . 

Similar to piRNA, our algorithm for base pair probabilities is also a dynamic programming algorithm 
that computes two types of recursive quantities: 1) the probability of a subsequence in one strand, 
and 2) the probability of a joint subsequence pair [irJr] and [is,js]- A region is the domain over 
which a probability is computed. For the first type, region is and for the second type, region is 
[irJr] x [isJs]- Tn e length pair of region [i R ,j R ) x [i s ,js] is (Ir = Jr -i R + \,k = js~is+l)- Om 
algorithm starts with (Ir =Lr,1$ = As) an d considers all length pairs decrementally down to (Ir = 1,1$ = 
1). For a fixed length pair (Ir,Is), recursive quantities for all the regions [jr, i R + l R — 1] x [is, is + Is — 1] 
are computed. 




Figure 1: Cases of the interaction partition function Q\ r - r t j . Figures [3j|2] show the recursion for Q Ib 
and Q la where b stands for bond and a stands for arc. 

For brevity, we present only two recursions and briefly describe how to derive the rest. Let P 1 , P Ib , 
and P Ia be the probability of those substructures that constitute respectively Q 1 , Q Ih , and Q Ia in piRNA 
iflOl . Figure [TJ shows the cases of Q\ R j r is which is the interaction partition function for the region 
[iRijR] x [isJs]- A horizontal line indicates the phosphate backbone, a solid curved line indicates an 
arc, and a dashed curved line encloses a region and denotes its two terminal bases which may be paired 
or unpaired. Letter(s) within a region specify a recursive quantity. White regions are recursed over and 
blue regions indicate those portions of the secondary structure that are fixed at the current recursion level 
and contribute their energy to the partition function as defined by the energy model. A solid vertical 
line indicates a bond, a dashed vertical line denotes two terminal bases of a region which may be base 
paired or unpaired, and a dotted vertical line denotes two terminal bases of a region which are assumed 
to be unpaired. For the interaction partition functions, grey regions indicate a reference to the partition 
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Figure 2: Cases of Q\ a R j R ,- s j s for which we assume at least one of ir and js is the end point of an 
interaction arc. 



functions for the single sequences. 

The following equations precisely define the intended recursions: 

QkjRjsjs = QiR;j R Qis.js+ QiRM-iQh+i-jsQ'kujRhM^ H QtR.h-iQki+ijsQkijRjsM' 

•R<h<JR <R<kl<jR 



Q 



lb 

'rJrJsJs 



o Ihh ■ ■ + 

^■'R jR,lS,JS ~ 



>r<*i<Jr 

'S <k 2<Js 



Qlhb Qlb i 

^iR,k[-k 2 Js^ l kiJ R ,is 1 k 2 



•R< k l<JR 
•S <k 2<Js 



sylhh qIo 

^iR.k\ ,k 2 Js ^ki jR,r s ,k 2 ' 



(2) 



QirJrJsJs H Qi R ,ki,k 2 JsQki + l,j R J s ,k2-l ^ Qi R ,k l ,k 2 JsQk l + l,j R ,i s ,k 2 -l^~ 

IR< k t<jR 'R< k l<JR 

V n Ie n 1 

^i R ,ki ,k 2 J s **Zk\ + 1 J R Js ,k 2 - 1 ; 

'R< k t<jR 
•S <k 2<JS 

in which Q, Q Ihh , Q Ihb , Q h , Q Is ' , and Q Ie are partition functions defined in (TOB - Note that among all 
partition function recursions given in ifTOl . Q 1 appears on the right hand side of only OJ). Therefore, 

(n Is -\-0 Is> +o Ie )o I 

pi y pla ^k u i R Js,k 2 1 ^kiJ R J s ,k 2 1 ^■k l J R Js,k 2 J ^iRjR,'s,Js ^ 

i R jR,hJs ~ hjRhM qIci ' ' 

l < k i<>R ^k[J R ,i s ,k 2 

j s < k 2<Ls 

with P[ Lr y L = 1 as the initial condition. Also note that Q Ia appears on the right hand side of only CD 
and (f2l), hence, 

pla y p l Qkuk-l Qj s + 1 ,k 2 Qi R J R JsJs , y plb Qh,i R Js,k 2 QiRjR4sJs 

'rJrJsJs L kuiRdsM qI L hJ R As± 2 ^jb ' 

l< k l<'R "kiJ R ,is,k 2 1<*i<i> "k\J R ,is,k 2 

jS< k 2^ L S js< k 2^ L S 
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Figure 3: Recursion for <2' , .• assuming iro j s is a bond. 



Using the same technique, by considering the contribution of each right hand side term that contains 
the target partition function, all the probability recursions are derived; see lfT5l for more details of the 
technique. Finally, base pair probabilities are 
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--P° ■ + 

'RjR ~ 



le 



Lpls | p 

l<ki<k 2 <L s 

P(lS,js) = P t,j s + P kuhJs,js + HtteJsJs ' 



l<ki<k 2 <L R 



P(ix,js)= £ P iRM,k2js + L 



Q'kuiR ,j s ,kfi i R+ 1 ^ Qk2 Js - 1 



jlhh 



'RthfaJs' Z- ki,k 3 ,k 2 ,k4 

<R< k l< L R ><k,<i R <k 3 <L R ^k\,k iy k 2 ,h 

l<k 2 <j s l<k 2 <j S <k 4 <L s 



-Jhb 



52 1 ki,k3,k 2 ,k4 

i<k[<i R <k^<L R 
l<k 2 <j S <k 4 <L s 



Q'kliRjsM^'iR+^kj^js- 1 + &k\j s -\> 
Qlhb 

^ki,k 3 ,k 2 ,k4 



(6) 
(7) 



(8) 



Equation © consists of two types of terms: 1) the probability that ig o j s is on the left of a Q Ih compo- 
nent, which includes the cases where iroJ s is in the middle of a hybrid component (see Figure UJ), and 
2) the probability that ir o j s is on the right of a Q Ih component. 



ir 



JR 



III 



JS 



is 



Ih 



k 2 



Figure 4: Cases of Q\ R j R is - s the interaction partition function for a single hybrid component. 



2.2 Sampling algorithm 

In this section, we present an efficient algorithm to generate random samples from the Boltzmann ensem- 
ble of interaction structures. Each structure is drawn with probability equal to its Boltzmann probability. 
Let h = Lr and m = L$. A naive sampling algorithm, similar to the Ding-Lawrence algorithm fffl . has 
0(n 2 m 2 ) time complexity in our case. In this paper, we give an efficient algorithm, which is inspired by 
Ponty's boustrophedon method QUI , to improve the time complexity to 0((n + m) 2 \og{n + m)). 

Our algorithm is iterative conditioning-sampling, based on the Ding-Lawrence algorithm. It starts 
with Q\ n j m on top of an empty stack. In each step, our algorithm pops the top of the stack, which is 
a partition function term such as <2f" ; fi ; s j s - It selects a recursion case, such as the last case (rightmost) 
in Figure |2] and samples a pair of indices k\ (or a single index in the case of a single-strand partition 
function) with appropriate probability. For this example, the probability of indices k\ € (/r,7r],^2 G 
\is,js) in the last case of Q Ia is 

n(h,k 2 ) = ^^A+iAAA-i . (9) 

L <Rf\ <JR Ui Rj ki ,k 2 J s W*! + 1 J R ,i S ,k 2 - 1 
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Let %(i R , •) = Jt(-, js) = 0. To describe the naive approach first let 



V(v)= £ K(iR + (t mod (j R -i R + l)), i s + 

Q<t<v 



JR -lR + 1 



)■ 



(10) 



Figure |3a) shows how the two-dimensional array of indices is traversed in \|/. Note that © and ( fTOl ) 
imply that V|/((jj? — i R + l)(js ~~ J 's + 1)) = 1- To properly sample k^k^, our algorithm first generates a 
uniform random number a* £ [0, 1 + n(j R ,j s )). Let v* be such that \|/(v* ) < a* < \|/(v* + 1), and let 



k* = i R + (v* mod (j R - i R + 1)), 
v* 

k* 2 = i s + 



]R - IR + 1 



(ID 
(12) 



It is clear that ^ sampled according to 71 distribution in (0]). Finally, £2^,* ^* ; and Q{* +1 j R ig k *_ l 
are pushed onto the stack. The algorithm terminates whenever the stack is empty. No matter in which 
order the indices k\,k2 are inspected in \|/, it takes Oinin) time in the worst case to determine v*. There- 
fore, the worst case running time of this naive algorithm for a single sample structure is 0(n 2 m 2 ). 




Figure 5: (a) Naive traversal of indices, (b) Balanced traversal of indices. 



Our algorithm's speed-up comes from the following trick: let the worst case correspond to k\ = 
[{i R + j R )/2] and k\ = [(is + js)/2]. Using this trick, the problem is split in an (almost) balanced way 
in every step. More precisely, our algorithm uses the traversal scheme of Figure Ob) in \|/ instead of the 
scheme of Figure [5J a). In that case, the following lemma characterizes the cost of each step c(^,fe|) in 
the algorithm. 

Lemma 2.1 The cost of sampling k*,k^ in our scheme shown in Figure\5\b) satisfies 

c(kl k* 2 ) <2(mm(kt-i R J R -k$)+mm(kZ-i s ,js-k* 2 ) + 2) 2 . (13) 
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Complexity analysis Let f(n,m) denote the worst case running time of our sampling algorithm for 
two nucleic acids of length n and m. In that case, / satisfies the following recursive inequality 

f(n,m)<2f{-,-) + ± >-. (14) 

It follows from (PT4l) that f(n,m) is 0{{n + m) 2 \og{n + m)). Hence, the following theorem holds: 
Theorem 2.2 The worst case time complexity of our algorithm is 0{{n + m) 2 \og(n + m)). 

3 Results 

We implemented the centroid, base pair probabilities, and our sampling algorithms in the new version of 
piRNA which is implemented in C++ and is parallelized with OpenMP. Our experiments were run on an 
IBM shared memory machine with 64 PPC CPUs and 256GB of RAM. Using piRNA, we predicted the 
centroid for five interacting RNA pairs in Table [3] The longest experiment corresponds to OxyS-fhlA 
pair which took about 4 days. We used the exact centroid computed using the base pair probabilities in 
this study. In future work, we would like to explore Ding et al. approach which consists of sampling the 
ensemble and clustering samples. Centroids of the clusters are used as candidate structures instead of 
the exact centroid of the ensemble. 



RNA pairs 




Sensitivity 




PPV 




Reference 




piRNA 


inteRNA 


Kato et al. 


piRNA 


inteRNA 


Kato et al. 




Tar-Tar* 


1.0 


1.0 


1.0 


0.875 


0.875 


0.933 


11 


Rlinv-R2inv 


0.900 


1.0 


0.900 


0.900 


1.0 


0.947 


on 


DIS-DIS 


1.0 


0.785 


0.785 


1.0 


0.785 


0.785 


EH 


CopA-CopT 


1.0 


0.863 


0.909 


1.0 


0.760 


0.800 


L22J 


OxyS-fhlA 


0.714 






0.746 








Average 


0.922 


0.912 


0.898 


0.904 


0.855 


0.866 





Table 1: Comparison of the sensitivity and PPV of RNA-RNA interaction structure prediction by piRNA 
centroid prediction with those of inteRNA (2 and Kato et al. software lf2TTl . 



Table[3]summarizes the specificity and positive predictive value (PPV) of our RNA-RNA interaction 
structure prediction by centroid prediction. We considered Kato et al. dataset ||2~T1 excluding RepZ- 
IncRNA54 and including OxyS-fhlA. Due to limitation on the computational resources, we replaced 
RepZ-IncRNA54 with OxyS-fhlA as RepZ-IncRNA54 exhibits a CopA-CopT-like secondary structure 
whereas OxyS-fhlA has a different structure with two kissing hairpins. We expected centroid prediction 
to outperform minimum-free-energy prediction, and our expectation is verified. 

4 Conclusions and future work 

We presented base pair probabilities of interacting nucleic acids based on our previous interaction par- 
tition function algorithm piRNA iflOl . The centroid of the Boltzmann ensemble is computed from the 
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base pair probabilities. We also presented an efficient algorithm to sample interaction structures from 
the ensemble. Our sampling algorithm uses a balanced scheme for traversing indices (depicted in Figure 
Ob)). The worst case running time complexity of our algorithm is 0((n + m) 2 \og(n + m)), in which n 
and m are the lengths of input strands. These algorithms are incorporated in the new version of piRNA. 

In future work, we would like to explore Ding et al. approach which consists of sampling the 
ensemble and clustering samples. Centroids of the clusters are used as candidate structures instead of 
the exact centroid of the ensemble. We believe success of such an approach critically depends on the 
clustering method, therefore, we would like to study sampling-clustering algorithms in future work. 

Acknowledgement H. Chitsaz received funding from Combating Infectious Diseases (BCID) initia- 
tive. S.C. Sahinalp was supported by Michael Smith Foundation for Health Research Career Award. 
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